Prediction of outcomes after cardiac arrest by a generative artificial intelligence model

Aims To investigate the prognostic accuracy of a non-medical generative artificial intelligence model (Chat Generative Pre-Trained Transformer 4 - ChatGPT-4) as a novel aspect in predicting death and poor neurological outcome at hospital discharge based on real-life data from cardiac arrest patients. Methods This prospective cohort study investigates the prognostic performance of ChatGPT-4 to predict outcomes at hospital discharge of adult cardiac arrest patients admitted to intensive care at a large Swiss tertiary academic medical center (COMMUNICATE/PROPHETIC cohort study). We prompted ChatGPT-4 with sixteen prognostic parameters derived from established post-cardiac arrest scores for each patient. We compared the prognostic performance of ChatGPT-4 regarding the area under the curve (AUC), sensitivity, specificity, positive and negative predictive values, and likelihood ratios of three cardiac arrest scores (Out-of-Hospital Cardiac Arrest [OHCA], Cardiac Arrest Hospital Prognosis [CAHP], and PROgnostication using LOGistic regression model for Unselected adult cardiac arrest patients in the Early stages [PROLOGUE score]) for in-hospital mortality and poor neurological outcome. Results Mortality at hospital discharge was 43% (n = 309/713), 54% of patients (n = 387/713) had a poor neurological outcome. ChatGPT-4 showed good discrimination regarding in-hospital mortality with an AUC of 0.85, similar to the OHCA, CAHP, and PROLOGUE (AUCs of 0.82, 0.83, and 0.84, respectively) scores. For poor neurological outcome, ChatGPT-4 showed a similar prediction to the post-cardiac arrest scores (AUC 0.83). Conclusions ChatGPT-4 showed a similar performance in predicting mortality and poor neurological outcome compared to validated post-cardiac arrest scores. However, more research is needed regarding illogical answers for potential incorporation of an LLM in the multimodal outcome prognostication after cardiac arrest.


Introduction
2][3] Most deaths in cardiac arrest survivors occur due to the withdrawal of life-sustaining therapies (WLST) when a poor neurological outcome is assumed. 4,5Hence, some cardiac arrest patients with a chance of substantial neurological recovery are at risk for premature WLST. 2,4,5Consequently, the present post-resuscitation care guidelines recommend a multimodal approach and delaying prognostication for at least 72 hours to decrease the risk of premature WLST. 6However, the multimodal approach does not integrate individual parameters (such as the time until the return of spontaneous circulation [ROSC] or lactate levels) as the predictive performance of individual parameters is limited. 7Therefore, it has been recommended to integrate several parameters into validated post-cardiac arrest scores, although these scores still have limited prognostic https://doi.org/10.1016/j.resplu.][10][11] Artificial intelligence (AI) in its wider form might bring additional prognostic possibilities, as supervised machine learning algorithms in the form of artificial neural networks have shown promising prognostic performance in cardiac arrest patients. 12,13Large generative artificial AI language models have recently gained worldwide attention with the release of Chat Generative Pre-trained Transformer 4 (ChatGPT-4), 14 which is capable of deductive reasoning and writing complex texts about a wide range of topics. 15,167][18][19][20][21][22][23][24] Unlike other large language models (LLM), 25 the system was not developed for healthcare purposes.There are some recent studies using ChatGPT-4 as a medical decision aid in the acute care setting, for example, in the triage of patients in the emergency room. 26,27However, the value of ChatGPT-4 for the prognostication of short-term outcomes in cardiac arrest patients remains unclear.To the best of our knowledge, there are currently no studies evaluating the value of LLMs for prognostication in patients after cardiac arrest.However, the potential of LLMs is promising, especially as LLMs might be provided with unstructured medical data. 28[11]

Study setting & participants
At the University Hospital Basel, a Swiss tertiary teaching hospital and cardiac arrest center, adult in-hospital cardiac arrest (IHCA) and out-of-hospital cardiac arrest (OHCA) patients admitted to the ICU were consecutively included in an ongoing prospective cohort study to assess prognostication after cardiac arrest and long-term outcomes.0][41] The data analyzed in the present study was prospectively collected from October 2012 until December 2022.The data collection, analysis, and reporting complied with the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) guidelines and the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) statement, respectively. 42,43

Ethics
The prospective cohort study has been approved by the local ethics committee (Ethikkommission Nordwest-und Zentralschweiz EKNZhttps://www.eknz.ch)and was conducted in compliance with the declaration of Helsinki and its amendments. 44Informed consent was primarily obtained from patients directly.In patients without the capacity of judgment, informed consent was obtained from surrogate decision-makers according to Swiss legal regulations.

Data collection and measures
Data was prospectively collected from the digital ICU patient-data management system and the medical records of the University Hospital Basel.The following data was collected for the purpose of this study:

Post-cardiac arrest scores
The predictive performance of ChatGPT-4 was compared to three post-cardiac arrest scores that can be used to predict outcomes after cardiac arrest: The OHCA score, the Cardiac Arrest Hospital Prognosis (CAHP) score, and the PROLOGUE score (PROgnostication using LOGistic regression model for Unselected adult cardiac arrest patients in the Early stages).All three scoring systems have been repeatedly validated. 8,30,45The scores integrate different parameters that have been associated with outcomes after cardiac arrest: Personal, cardiac arrest-related, and clinical/laboratory parameters upon hospital and/or ICU admission.[11]

Outcomes
The primary outcome was defined as in-hospital mortality.The secondary outcome was poor neurological outcome at hospital discharge measured by the Cerebral Performance Category (CPC), which is recommended by international expert consensus. 46,47The CPC system classifies the neurological outcome after cardiac arrest into five different levels: CPC = 1: Good neurological recovery; CPC = 2: Moderate cerebral disability; CPC = 3: Severe cerebral disability; CPC = 4: Persistent vegetative state or coma; CPC = 5: Death including brain death. 48In accordance with expert consensus and previous research in the field, the neurological outcome was then dichotomized into good outcome (CPC 1-2) and poor outcome (CPC 3-5). 46,47

Development of the chat prompt and data extraction from ChatGPT-4
For the development of a standardized chat prompt, we utilized an iterative approach as suggested by Kanjee et al. 17 An introductory text was drafted and refined by trial and error until the desired responses were given by ChatGPT-4.The introductory text rigorously explained the task and the setting to the LLM.The complete standardized chat prompt can be obtained from the online-only supplement (eMethods 1).In brief, the LLM was asked to put itself into the position of an 'AI intensive care doctor' receiving a cardiac arrest patient with ROSC in his intensive care unit.Also, the LLM was provided with sixteen patient-related parameters.These have been selected as they are well-known predictors of outcomes after cardiac arrest and are all included in one or more of the post-cardiac arrest scores (OHCA, CAHP, PROLOGUE).Furthermore, uploading unstructured data in the form of medical charts to a cloud-based LLM would cause significant issues regarding data privacy.The following sixteen parameters were provided: Age, sex, observed cardiac arrest, setting, initial rhythm, no-flow time, low-flow time, epinephrine administration during resuscitation, pH at ICU admission, potassium level at ICU admission, lactate level at ICU admission, haemoglobin level at ICU admission, phosphate level at ICU admission, creatinine level at ICU admission, pupillary light reflex at ICU admission, GCS motor score at ICU admission.The LLM was then asked to provide replies to the following two questions: -Will this patient survive to hospital discharge?Please provide a yes/no answer and the probability of survival in percent.-Will this patient experience a good neurological outcome at hospital discharge as defined by a cerebral performance category scale of 1 or 2? Please provide a yes/no answer and the probability of a good neurological outcome in percent.
The chat prompt for each patient was generated by a preprogrammed Excel (Microsoft, Redmond, Washington, USA) spreadsheet (eMethods 2), which combined the standardized chat prompt with the cardiac arrest parameters of each patient, which allowed to copy-paste the whole chat prompt in a single command thereby reducing the possibility of erroneous data entries.
The LLM's answers to the questions were then registered in a separate Excel (Microsoft, Redmond, Washington, USA) spreadsheet.We verified that the LLM would assess each patient individually by re-opening a new chat after each patient.In total, we performed three runs so that each patient was assessed three times by the LLM.Regarding the dichotomous yes/no answers, the most frequent answer of the three runs was counted, e.g., if the individual answers were yes/yes/no, the overall answer was registered as yes.Regarding the probability of survival and the probability of good neurological outcome in percent, the mean value of the three runs was used for statistical analysis.All chat prompts, including answers, have been thoroughly documented by screenshots.If the LLM provided non-logical answers (i.e., hallucinations), such as providing a higher probability of survival with a good neurological outcome than survival, the LLM was asked to reconsider its answer, also using a standardized text input.For the statistical analysis, the corrected, logical answers were used.

Statistical analysis
To characterize the patient cohort, descriptive statistics, including means (±SD), were used for continuous variables, whereas frequencies were reported for binary or categorical variables.Receiver operating characteristics (ROC) and corresponding areas under the curve (AUC) were created to evaluate the prognostic performance of ChatGTP-4 to predict outcomes and to compare it to the OHCA, CAHP, and Prologue scores.We calculated sensitivity, specificity, positive and negative predictive values, and likelihood ratios for mortality and poor neurological outcome predicted by ChatGTP-4.Missing data was handled by multiple imputations based on chained equations to enhance the completeness of the dataset, mitigate biases arising from missing data, and contribute to more robust and reliable analyses, thus strengthening the validity of our study findings.Imputations were calculated using multiple covariables (i.e., socio-demographics, comorbidities, resuscitation information, vital signs), including main outcomes (death, neurological outcome) as suggested by Sterne et al. 49 STATA 15.0 was used for statistical analyses, and a two-sided p-value of <0.05 was considered significant.

Baseline characteristics
Of the 713 included patients, 309 patients died in hospital, and 387 had a poor neurological outcome (including CPC 5 = death) at hospital discharge.The baseline characteristics of the cohort overall and stratified based on survival status are shown in Table 1.Factors significantly associated with mortality were higher age, pre-existing comorbidities (e.g., diabetes, chronic obstructive pulmonary disease, malignant disease), cardiac arrest at home, unwitnessed arrest, nonshockable initial heart rhythm, longer time to ROSC, no bystander CPR, longer no-flow and low-flow time, higher doses of epinephrine during resuscitation, non-reactive pupils and a low Glasgow coma scale motor score at ICU admission.
In addition to the probabilities, we also looked at the prediction of mortality as binary outcomes.ChatGTP-4 predicted death in 229 patients and survival in 484 patients.Overall, ChatGTP-4 0 s positive predictive value (PPV) was 85% (194/229), and the negative predictive value (NPV) was 76% (369/484), resulting in a sensitivity and specificity of 63% and 91%, respectively (Table 2).

Hallucinations of ChatGPT-4 concerning the prediction of probabilities
In all three runs of the ChatGPT-4 experiment, instances of hallucinations occurred in the form of irrational responses to the input prompts provided to ChatGPT-4.Specifically, we observed irrational responses in 59 out of 713 cases (8.3%), 94 out of 713 cases (13.2%), and 100 out of 713 cases (14.0%) in the first, second, and third run, respectively.When directly entering a standardized prompt requesting a correction, all illogical responses were subsequently replaced with logical and coherent answers.The prognostic performance of the uncorrected prediction, however, was similar to the final results regarding mortality (AUROC of 0.84) and inferior regarding neurological outcome (AUROC of 0.75).

Discussion
This study compared the prognostic value of a large language model (ChatGPT-4) for prognostication in cardiac arrest patients with that of well-validated and established cardiac arrest scores.The prognostic performance of ChatGPT-4 for predicting mortality and poor neurological outcomes was good and in the range of the validated postcardiac arrest scores, demonstrating the potential capabilities of artificial intelligence in clinical practice.However, some findings need further discussion.
First, in about 14% (300/2139) of chat queries, the untrained ChatGPT-4 generated illogical answers (i.e., hallucinations), such as a higher probability of poor neurological outcome compared to the probability of death.Here, we asked ChatGPT-4 to reconsider and correct the prediction, which was done without generating further illogical answers.This illustrates that artificial intelligence still may be used most efficiently when combined with 'human intelligence', i.e., an experienced clinician.Furthermore, this emphasizes that the use of LLMs in clinical practice needs close supervision by its user.
End-of-life decisions are inherently difficult and require a high level of exclusively human qualities such as professional experience, compassion, emotions, and consciousness of cultural backgrounds and social inequalities.However, LLMs are solely machines that base decisions on stochastic principles without consciousness or emotions.
Although there is an increasing number of studies using LLMs in medicine, studies assessing LLM's prediction skills for patient outcomes are scarce.In a small study including 30 emergency department patients, ChatGPT-3.5 and -4 0 s ability to generate a meaningful differential diagnosis was comparable to medical experts.However, a potential association with outcomes was not assessed. 19In a pre-print online publication investigating the performance of three large LLMs (ChatGPT-3.5, ChatGPT-4, Bard) for the prediction of 10year cardiovascular risk, the LLM's performance was comparable to the Framingham score. 23lthough the performance of LLM in predicting medical outcomes seems promising, important limitations need to be addressed.First, the predictive value does not significantly exceed known validated post-cardiac arrest scores.As the positive predictive value for mortality and/or poor neurological outcome are not satisfactory, clinicians should never base their decisions regarding withdrawal of life-sustaining therapies on single tests or scores.This is reflected in the clinical guidelines recommending a multimodal approach without the use of post-cardiac arrest scores.
Also, clinicians assessing LLMs should be aware of the 'stochastic parrot' principle proposed by Bender et al. 50and emphasized by Boussen et al. 51 .Due to the underlying algorithm, an LLM does neither understand the input that is entered nor the output generated.It   just rigidly repeats structures and patterns it has been trained on, including prejudices, stereotypes, and social inequalities. 52This might be an explanation for the comparable but not significantly better performance of ChatGPT-4 in the prediction of mortality and neurological outcomes when compared to validated post-cardiac arrest scores.This aligns with other studies finding comparable, but not superior, performances in clinical or theoretical contexts. 17,53Due to the algorithm behind LLMs, the user should be aware of a certain number of 'hallucinations' or illogical answers generated.Hallucinations are a well-known shortcoming of LLMs and are associated with the stochastic parrot principle. 50,51However, in our study, the rate of hallucinations was considerably low, with a maximum value of 14.1% per run.Nevertheless, ChatGPT-4 0 s ability to detect illogical answers is limited and still warrants the presence of a human controller. 50,51,54dditionally, ChatGPT-4 provided inconsistent answers in some patients, which we tried to account for by using the most frequent answer out of the three runs.However, this is a major limitation the use of ChatGPT-4 in prognosticating outcomes after cardiac arrest.
The field of LLM in medicine is exponentially increasing, as will the capabilities of LLMs.Hence, future research should focus on the evaluation of performance-enhancing plugins, which might have the ability to reduce the production of false results and/or references by checking the results with external databases such as PubMed. 55urthermore, specific training of healthcare professionals and transforming medical datasets into easily accessible and structured databases will be crucial to improving the value of LLMs for clinical questions, as recently shown in a study integrating an LLM in the clinical workflow. 28Also, further specific training of the LLM is warranted to enable the LLM to perform significantly better than validated scores.However, specific training requires a training dataset which can be difficult to obtain, if considering patient data safety.
Training an LLM with unstructured medical charts might involuntarily expose patients' identities or upload confidential data to a cloudbased LLM.In the present study this issue was addressed through uploading anonymized and structured patient data.Furthermore, training data must be well chosen and representational for the training purpose, as otherwise real-world bias might be reproduced by the LLM.
At the moment of prompting ChatGPT-4 was designed to answer queries based on its training data only, and its current knowledge did not extend beyond September 2021.Furthermore, the 'black box problem', describing the current lack of understanding of the underlying algorithm and its method of solving, remains an issue.This is in line with the recently published expert opinion, 56 that we need to ensure that these models are safe and effective through vigorous testing, uncovering possible biases, and thereby enabling a correction and training of the models. 54uture research should focus on the direct integration of LLMs into clinical information systems, which could substantially decrease the administrative workload for physicians, allowing a focus on patient care as a historic core competence.However, concerns regarding data privacy will be significant.

Strengths and limitations
To the best of our knowledge, this is the first study assessing the prognostication of outcomes after cardiac arrest by an LLM using real-world data.A pragmatic approach aiming at high reproducibility and data integrity using an established post-cardiac arrest database was used.However, the present study also has several limitations.
First, the parameters used for prognostication were also available to the clinicians involved in WLST.Hence, there might be a certain risk of self-fulfilling prophecies. 57,58In addition, the studies the LLM has been exposed to might also have been influenced by self-  Second, due to the algorithm behind LLMs, the user should be aware of a certain number of hallucinations or illogical answers generated.
Third, as ChatGPT-4 was not designed for healthcare purposes, its applicability and validity to answer specific clinical questions remains unclear and warrants further research.Fourth, our study is based on a single-center cohort, limiting its generalizability to other centers or regions and emphasizing the importance of future research in diverse contexts to enhance the external validity of the results.

Conclusions
ChatGPT-4 showed a good performance in predicting mortality and poor neurological outcome comparable to validated post-cardiac arrest scores and thus may be a helpful future tool for early risk prediction in adult cardiac arrest patients.However, due to frequent hallucinations in the output data, ChatGPT-4 still needs human supervision.Also, training a specific future LLM needs structured medical data sets, and future research should focus on validation of LLMs in various clinical settings.

Fig. 1 -
Fig. 1 -Comparison of ROC curves for mortality at hospital discharge.Abbreviations: AUROC Area under the receiver operating characteristics curve; CAHP Cardiac arrest hospital prognosis; ChatGPT-4 Chat Generative Pre-Trained Transformer 4; OHCA Out-of-hospital cardiac arrest; PROLOGUE Prognostication using logistic regression model for unselected adult cardiac arrest patients in the early stages.

Fig. 2 -
Fig. 2 -Comparison of ROC curves for poor neurological outcome at hospital discharge (Cerebral Performance Category Scale 3-5 including death).Abbreviations: AUROC Area under the receiver operating characteristics curve; CAHP Cardiac arrest hospital prognosis; ChatGPT-4 Chat Generative Pre-Trained Transformer 4; OHCA Outof-hospital cardiac arrest; PROLOGUE Prognostication using logistic regression model for unselected adult cardiac arrest patients in the early stages.
fulfilling prophecies.Hence, one cannot be sure to what extent the LLM can predict true outcomes or just reproduces the self-fulfilling prophecies present in studies the LLM has been exposed to.

Table 1 .
Baseline characteristics of the study population stratified according to the primary outcome (in-hospital mortality).Abbreviations: COPD Chronic obstructive pulmonary disease; CPR Cardiopulmonary resuscitation; ICU Intensive care unit; ROSC Return of Spontaneous Circulation; SD standard deviation.

Table 2 .
Performance of ChatGPT-4 for the prediction of in-hospital mortality and poor neurological at hospital discharge (Cerebral Performance Category Scale 3-5 including death).Abbreviations: ChatGPT-4 Chat Generative Pre-Trained Transformer 4, CI confidence interval.